Fast probabilistic analysis of sequence function using scoring matrices

نویسندگان

  • Thomas D. Wu
  • Craig G. Nevill-Manning
  • Douglas L. Brutlag
چکیده

MOTIVATION We present techniques for increasing the speed of sequence analysis using scoring matrices. Our techniques are based on calculating, for a given scoring matrix, the quantile function, which assigns a probability, or p, value to each segmental score. Our techniques also permit the user to specify a p threshold to indicate the desired trade-off between sensitivity and speed for a particular sequence analysis. The resulting increase in speed should allow scoring matrices to be used more widely in large-scale sequencing and annotation projects. RESULTS We develop three techniques for increasing the speed of sequence analysis: probability filtering, lookahead scoring, and permuted lookahead scoring. In probability filtering, we compute the score threshold that corresponds to the user-specified p threshold. We use the score threshold to limit the number of segments that are retained in the search process. In lookahead scoring, we test intermediate scores to determine whether they will possibly exceed the score threshold. In permuted lookahead scoring, we score each segment in a particular order designed to maximize the likelihood of early termination. Our two lookahead scoring techniques reduce substantially the number of residues that must be examined. The fraction of residues examined ranges from 62 to 6%, depending on the p threshold chosen by the user. These techniques permit sequence analysis with scoring matrices at speeds that are several times faster than existing programs. On a database of 12 177 alignment blocks, our techniques permit sequence analysis at a speed of 225 residues/s for a p threshold of 10-6, and 541 residues/s for a p threshold of 10-20. In order to compute the quantile function, we may use either an independence assumption or a Markov assumption. We measure the effect of first- and second-order Markov assumptions and find that they tend to raise the p value of segments, when compared with the independence assumption, by average ratios of 1.30 and 1.69, respectively. We also compare our technique with the empirical 99. 5th percentile scores compiled in the BLOCKSPLUS database, and find that they correspond on average to a p value of 1.5 x 10-5. AVAILABILITY The techniques described above are implemented in a software package called EMATRIX. This package is available from the authors for free academic use or for licensed commercial use. The EMATRIX set of programs is also available on the Internet at http://motif.stanford.edu/ematrix.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Improved Sensitivity of Nucleic Acid Database Searches Using Application-Specific Scoring Matrices

Scoring matrices for nucleic acid sequence comparison that are based on models appropriate to the analysis of molecular sequencing errors or biological mutation processes are presented. In mammalian genomes, transition mutations occur significantly more frequently than transversions, and the optimal scoring of sequence alignments based on this substitution model differs from that derived assumi...

متن کامل

Comparative Study of Random Matrices Capability in Uncertainty Detection of Pier’s Dynamics

Because of random nature of many dependent variables in coastal engineering, treatment of effective parameters is generally associated with uncertainty. Numerical models are often used for dynamic analysis of complex structures, including mechanical systems. Furthermore, deterministic models are not sufficient for exact anticipation of structure’s dynamic response, but probabilistic models...

متن کامل

Finding Winner Alignments with Multiple Scoring Matrices

How to align two given sequences properly is a fundamental problem in bioinformatics. In the sequence alignment problem, the most essential thing that directly affects the resulting alignment is the scoring matrix. There are a variety of scoring matrices used for alignment, and each of them has its own purpose in biosequence alignment. It seems unlikely that an alignment is optimal for each sco...

متن کامل

Numerical solution of general nonlinear Fredholm-Volterra integral equations using Chebyshev ‎approximation

A numerical method for solving nonlinear Fredholm-Volterra integral equations of general type is presented. This method is based on replacement of unknown function by truncated series of well known Chebyshev expansion of functions. The quadrature formulas which we use to calculate integral terms have been imated by Fast Fourier Transform (FFT). This is a grate advantage of this method which has...

متن کامل

On the fine spectrum of generalized upper triangular double-band matrices $Delta^{uv}$ over the sequence spaces $c_o$ and $c$

The main purpose of this paper is to determine the fine spectrum of the generalized upper triangular double-band matrices uv over the sequence spaces c0 and c. These results are more general than the spectrum of upper triangular double-band matrices of Karakaya and Altun[V. Karakaya, M. Altun, Fine spectra of upper triangular doubleband matrices, Journal of Computational and Applied Mathematics...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Bioinformatics

دوره 16 3  شماره 

صفحات  -

تاریخ انتشار 2000